Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator.
Hacker News is community based. Articles are posted and upvoted by users. The most popular articles make it to the top of the site. The algorithm that computes the top ranked articles is largely impacted by time, giving new articles a significant boost. After all, it's a news site.
Content that can be submitted to Hacker News is defined as "anything that gratifies one's intellectual curiosity". If you haven't seen the site yet, check it out. You can find more details about the Hacker News algorithm in this blog post.
If you're already an avid reader of Hacker News you might have wondered "Is there a topic commonality between the popular Hacker News stories? or put differently if you're an author - "What topic do I have to write about to make it to the top of Hacker News?
This notebook is trying to answer these questions using the Hacker News API and the AlchemyAPI to get ranked concepts for textual data. Here are the steps we'll take:
To run this notebook, register for a free AlchemyAPI account. After you receive your API key, paste your API key in the cell below:
In [1]:
api_key = 'PASTE_ALCHEMY_API_KEY_HERE'
Import required Python libraries:
In [2]:
import requests
import os
import pandas as pd
from datetime import datetime
The Hacker News API provides an end point topstories
that returns the 500 most highly rated stories at this time.
In [3]:
hacker_news_api_base_url = 'https://hacker-news.firebaseio.com/v0/'
hacker_news_feature_url_item = 'item/'
hacker_news_feature_url_topstories = 'topstories'
hacker_news_api_parameters = '.json?print=pretty'
In [4]:
def get_story_for_id(story_id):
''' Queries the Hacker News API for story information about for the given story_id. '''
story_request_url = hacker_news_api_base_url + hacker_news_feature_url_item + unicode(story_id) + hacker_news_api_parameters
story = requests.get(story_request_url).json()
return story
In [5]:
def get_story_details(story):
''' Filter relevant story information from the given Hacker News API story object. '''
# remove descendants from story (e.g. comments), because we don't use them
if 'kids' in story: del story['kids']
# encode text field content as ascii (work around IPython defect https://github.com/ipython/ipython/issues/6799)
if 'title' in story: story['title'] = story['title'].encode('ascii', 'ignore')
if 'text' in story: story['text'] = story['text'].encode('ascii', 'ignore')
if 'url' in story: story['url'] = story['url'].encode('ascii', 'ignore')
return story
In [6]:
def get_all_story_details(story_ids):
''' Queries Hacker News API for relevant story information for given list of story_ids. '''
all_story_details = []
for story_id in story_ids:
all_story_details.append(get_story_details(get_story_for_id(story_id)))
return all_story_details
In [7]:
current_top_500_stories_url = hacker_news_api_base_url + hacker_news_feature_url_topstories + hacker_news_api_parameters
current_top_500_stories = requests.get(current_top_500_stories_url).json()
Take a look at what we got to make sure we have a list of story ids:
In [ ]:
current_top_500_stories
The top 500 stories provided by the Hacker News API are a snapshot that reflect the currently most popular stories. To enable an analysis of the most popular stories over time, it is helpful to work with a larger corpus of stories.
Let's save the story ids on disk for future usage with the Python pickle library provided by Pandas convenience methods read_pickle and to_pickle:
In [9]:
story_ids_file_name = 'hacker_news_story_ids.pickle'
def update_saved_story_ids(story_ids, story_ids_file_name):
''' Read story ids from disk, merge with given story_ids, and save back to disk. '''
file_story_ids = []
try:
file_story_ids = pd.read_pickle(story_ids_file_name)
except IOError as err:
# file for story ids does not yet exist, move on
pass
merged_story_ids = set(file_story_ids).union(set(story_ids))
pd.Series(list(merged_story_ids)).to_pickle(story_ids_file_name)
return merged_story_ids
In [10]:
story_ids_up_until_today = update_saved_story_ids(current_top_500_stories, story_ids_file_name)
In [11]:
len(story_ids_up_until_today)
Out[11]:
Now, query the details and show a sample of the first five stories (If you happen to hit JSON errors, try running the cell again, as these seem to happen intermittently):
In [14]:
all_story_details = get_all_story_details(list(story_ids_up_until_today))
# optionally, comment the first line and uncomment the two lines below to use a subset of stories to work with to reduce subsequent API requests against AlchemyAPI
# top_10_stories = list(story_ids_up_until_today)[0:10]
# all_story_details = get_all_story_details(top_10_stories)
stories_df = pd.DataFrame.from_dict(all_story_details)
stories_df.head(5)
Out[14]:
One of the features provided by AlchemyAPI is Concept Tagging. It allows extracting concepts from web-based content available at a given URL. We're going to apply concept tagging to the URLs from the Hacker News stories.
In [15]:
alchemy_api_base_url = 'http://access.alchemyapi.com/calls/url/'
alchemy_api_parameters = '?apikey=' + api_key + '&outputMode=json&url='
alchemy_feature_url_concepts = "URLGetRankedConcepts"
In [16]:
def get_concepts_for_url(story_url, story_urls_and_concepts):
''' Query AlchemyAPI concept tagging for given url and add result to given story_urls_and_concepts dictionary. '''
if story_url in story_urls_and_concepts:
# attempt to get concepts for story url from disk
concepts = story_urls_and_concepts.get(story_url)
else:
# no concepts available on disk for story url, query AlchemyAPI for concepts and add save for future use
request_url = alchemy_api_base_url + alchemy_feature_url_concepts + alchemy_api_parameters + story_url
concepts = requests.get(request_url).json()
story_urls_and_concepts[story_url] = concepts
return concepts
Let's test the function by running it against a test_url
pointing to an article from cnn.com:
In [17]:
story_urls_and_concepts = {}
test_url = 'http://www.cnn.com/2009/CRIME/01/13/missing.pilot/index.html'
get_concepts_for_url(test_url, story_urls_and_concepts)
Out[17]:
You should see a JSON document containing list of concepts
extracted from the website at the given url. Each concept
is identified by the text
and is assigned a relevance
which measures how confident AlchemyAPI is that the website is talking about this concept
. Based on the identified concept, the JSON also contains links to publicly available knowledge bases DBPedia and Yago. Feel free to test AlchemyAPI concept tagging for articles that you're interested in by replacing the test_url
.
The free price tier for AlchemyAPI allows 1000 queries per day. To reduce the number of AlchemyAPI query requests, we create a dictionary story_urls_and_concepts
to store the story url with the list of detected concepts:
In [26]:
story_urls_and_concepts_file_name = 'story_urls_and_concepts.pickle'
try:
story_urls_and_concepts = pd.read_pickle(story_urls_and_concepts_file_name)
except IOError as err:
# file for story urls and concepts does not yet exist, move on
story_urls_and_concepts = {}
pass
Now that we can extract concepts for articles at a given url we need to extract the concepts for a Hacker News story_id
. For each identified concept we need to keep track how often and from what story it was extracted. We need to aggregate popularity measures like score and number of descendants from the stories.
The resulting data structure is a dictionary of dictionaries containing the following information:
{
'Programming language': {
concept_dict {
'occurs' : 11 # there are 10 occurrences of the concept 'Programming language' in our stories
'score' : 543 # aggregated score of all stories containing concept 'Progamming language' is 543
'ids' : [123,456] # story ids of all stories containing concept 'Programming language'
'descendants' 94 # aggregated number of all descendants of all stories containing 'Progamming language'
'links' : ['www.cnn.com/programming_language', ... ] # links to all stories about 'Progamming language'
}
}
}
The following function aggregates all this information about all concepts extracted from all stories:
In [27]:
def get_concepts_for_id(story_id, all_concepts_dicts, story_urls_and_concepts):
''' Extracts concepts for given story_id and aggregates story popularity information. '''
print "Querying concepts for story " + unicode(story_id) + "..."
request_url = hacker_news_api_base_url + hacker_news_feature_url_item + unicode(story_id) + hacker_news_api_parameters
print(request_url)
story = requests.get(request_url).json()
# ignore "Ask HN" and job posts, only consider actual stories
if story.get('type') == 'story':
# make sure story has url that links to article
if story.get('url') is not None:
# extract concepts using AlchemyAPI
concept_result = get_concepts_for_url(story.get('url'), story_urls_and_concepts)
if concept_result['status'] == 'OK':
concepts = concept_result.get('concepts')
for concept in concepts:
# check, if we previously encountered the concept in another article
concept_dict = {}
concept_text = concept.get('text')
# ignore concepts with low score
if (float(concept.get('relevance')) > 0.6):
concept_dict['occurs'] = 1
concept_dict['relevance'] = concept.get('relevance')
concept_dict['ids'] = [story_id]
concept_dict['score'] = story.get('score')
concept_dict['descendants'] = story.get('descendants')
concept_dict['links'] = [story.get('url')]
if concept_text in all_concepts_dicts:
# merge additional concept info with already existing concept info
# add up the scores and number of descendants by concept
already_existing_concept = all_concepts_dicts.get(concept_text)
already_existing_concept['occurs'] = already_existing_concept['occurs'] + 1
already_existing_concept['score'] = already_existing_concept['score'] + story.get('score')
already_existing_concept['descendants'] = already_existing_concept['descendants'] + story.get('descendants')
already_existing_concept['links'] = already_existing_concept['links'] + concept_dict['links']
already_existing_concept['ids'] = already_existing_concept['ids'] + concept_dict['ids']
else:
all_concepts_dicts[concept_text] = concept_dict
return all_concepts_dicts
Let's test the get_concepts_for_id
helper function by providing it a valid Hacker News story id:
In [28]:
all_concepts_dicts = {}
test_story_id = 9226497
all_concepts_dicts = get_concepts_for_id(test_story_id, all_concepts_dicts, story_urls_and_concepts)
print all_concepts_dicts
You should see a dictionary of concepts with links to stories, score and descendant information. Feel free to enter different test_story_id
s.
Now that we know our get_concepts_for_id
function works, let's query and aggregate the concepts for all Hacker News stories:
In [29]:
len(story_urls_and_concepts)
Out[29]:
In [30]:
len(stories_df)
Out[30]:
In [ ]:
story_counter = 1
for story_id in story_ids_up_until_today:
# optionally, comment the line above and uncomment the line below to limit requests to 10 stories
# for story_id in top_10_stories:
all_concepts_dicts = get_concepts_for_id(story_id, all_concepts_dicts, story_urls_and_concepts)
print 'Done. ' + unicode(story_counter) + ' stories queried.'
story_counter = story_counter + 1
Save the story_urls_and_concepts
dictionary to disk for future use. The dictionary got created and updated while iterating through all story_id
s and is valueable at this point, because it contains the concepts from querying the AlchemyAPI, which supports a limited number of requests per day. Without saving the story_urls_and_concepts
to disk, we would hit that 1000 requests per day limit after just a few days of collecting story ids.
In [33]:
import pickle
with open(story_urls_and_concepts_file_name, 'wb') as story_urls_and_concepts_file:
pickle.dump(story_urls_and_concepts, story_urls_and_concepts_file)
In [34]:
all_concepts_df = pd.DataFrame.from_dict(all_concepts_dicts, orient='index')
all_concepts_sorted_by_score_df = all_concepts_df.sort(columns='score', ascending=False)
all_concepts_sorted_by_score_df
Out[34]:
In [35]:
all_concepts_sorted_by_descendants_df = all_concepts_df.sort(columns='descendants', ascending=False)
all_concepts_sorted_by_descendants_df
Out[35]:
You can be the judge whether these are the topics you would have expected to be on top. As you run this notebook over time, more stories will be available. You can also run this Hacker News and AlchemyAPI.ipynb
notebook recurringly every day by running the notebook Hacker News Runner.ipynb
. This will aggregate data over time and allow for more detailed analysis.
In this notebook we showed the usage of the Hacker News API and the AlchemyAPI. We used the AlchemyAPI concept tagging feature to extract topics from the Hacker News stories. Finally, we sorted aggregated popularity information about the stories for each concept and showed it in a tabular form sorted by different popularity measures.
The invocation of the AlchemyAPI was rather simple. A lot of the code to write intermediate result to disk is to work around the 1000 requests per day limitation.
Concept detection is only one feature of the AlchemyAPI. Check out more features in the API documentation.